Processing dynamic data within an adaptive oracle-trained learning system using curated training data for incremental re-training of a predictive model

ABSTRACT

In general, embodiments of the present invention provide systems, methods and computer readable media for an adaptive oracle-trained learning framework for automatically building and maintaining models that are developed using machine learning algorithms. In embodiments, the framework leverages at least one oracle (e.g., a crowd) for automatic generation of high-quality training data to use in deriving a model. Once a model is trained, the framework monitors the performance of the model and, in embodiments, leverages active learning and the oracle to generate feedback about the changing data for modifying training data sets while maintaining data quality to enable incremental adaptation of the model.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. application Ser. No.14/578,205, titled “PROCESSING DYNAMIC DATA WITHIN AN ADAPTIVEORACLE-TRAINED LEARNING SYSTEM USING CURATED TRAINING DATA FORINCREMENTAL RE-TRAINING OF A PREDICTIVE MODEL,” filed Dec. 19, 2014,which claims the benefit of U.S. Provisional Application No. 61/920,251,entitled “PROCESSING DYNAMIC DATA USING AN ADAPTIVE CROWD-TRAINEDLEARNING SYSTEM,” filed Dec. 23, 2013, and of U.S. ProvisionalApplication No. 62/069,692, entitled “CURATING TRAINING DATA FORINCREMENTAL RE-TRAINING OF A PREDICTIVE MODEL,” filed Oct. 28, 2014, theentire contents of which are incorporated herein by reference in theirentirety.

FIELD

Embodiments of the invention relate, generally, to an adaptive systemfor building and maintaining machine learning models.

BACKGROUND

A system that automatically identifies new businesses based on datasampled from a data stream representing data collected from a variety ofonline sources (e.g., websites, blogs, and social media) is an exampleof a system that processes dynamic data. Analysis of such dynamic datatypically is based on data-driven models that depend on consistent data,yet dynamic data are inherently inconsistent in both content andquality.

Current methods for building and maintaining models that process dynamicdata exhibit a plurality of problems that make current systemsinsufficient, ineffective and/or the like. Through applied effort,ingenuity, and innovation, solutions to improve such methods have beenrealized and are described in connection with embodiments of the presentinvention.

SUMMARY

In general, embodiments of the present invention provide herein systems,methods and computer readable media for building and maintaining machinelearning models that process dynamic data.

Data quality fluctuations may affect the performance of a data-drivenmodel, and, in some cases when the data quality and/or statisticaldistribution of the data has changed over time, the model may have to bereplaced by a different model that more closely fits the changed data.Obtaining a set of accurately distributed, high-quality training datainstances for derivation of a model is difficult, time-consuming, and/orexpensive. Typically, high-quality training data instances are data thataccurately represent the task being modeled, and that have been verifiedand labeled by at least one reliable source of truth (an oracle,hereinafter) to ensure their accuracy.

There is a declarative framework/architecture for clear definition ofthe end goal for the output data. The framework enables end-users todeclare exactly what they want (i.e., high-quality data) without havingto understand how to produce such data. Once a model has been derivedfrom an initial training data set, being able to perform real timemonitoring of the performance of the model as well as to perform dataquality assessments on dynamic data as it is being collected can enableupdating of the training data set so that the model may be adaptedincrementally to fluctuations of quality and/or statistical distributionof dynamic data. Incremental adaptation of a model reduces the costsinvolved in repeatedly replacing the model.

As such, and according to some example embodiments, the systems andmethods described herein are therefore configured to implement anadaptive oracle-trained learning framework for automatically buildingand maintaining machine learning models that are developed using machinelearning algorithms. In embodiments, the framework leverages at leastone oracle (e.g., a crowd) for automatic generation of high-qualitytraining data to use in deriving a model. Once a model is trained, theframework monitors the performance of the model and, in embodiments,leverages active learning and the oracle to generate feedback about thechanging data for modifying training data sets while maintaining dataquality to enable incremental adaptation of the model.

The details of one or more embodiments of the subject matter describedin this specification are set forth in the accompanying drawings and thedescription below. Other features, aspects, and advantages of thesubject matter will become apparent from the description, the drawings,and the claims.

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWING(S)

Having thus described the invention in general terms, reference will nowbe made to the accompanying drawings, which are not necessarily drawn toscale, and wherein:

FIG. 1A illustrates a first embodiment of an example system that can beconfigured to implement an adaptive oracle-trained learning frameworkfor automatically building and maintaining a predictive machine learningmodel in accordance with some embodiments discussed herein;

FIG. 1B illustrates a second embodiment of an example system that can beconfigured to implement an adaptive oracle-trained learning frameworkthat is further configured to include a training data manager componentfor curating the training data used to train and/or re-train thepredictive model 130 in accordance with some embodiments discussedherein;

FIG. 2 is a flow diagram of an example method for automaticallygenerating an initial predictive model and a high-quality training dataset used to derive the model within an adaptive oracle-trained learningframework in accordance with some embodiments discussed herein;

FIG. 3 illustrates an exemplary process for automatically determiningwhether an input multi-dimensional data instance is an optimal choicefor labeling and inclusion in at least one initial training data setusing an adaptive oracle-trained learning framework in accordance withsome embodiments discussed herein;

FIG. 4 is a flow diagram of an example method for determining whether aninput multi-dimensional data instance is an optimal choice for labelingand inclusion in at least one initial training data set in accordancewith some embodiments discussed herein;

FIG. 5 is a flow diagram of an example method 500 for adaptiveprocessing of input data by an adaptive learning framework in accordancewith some embodiments discussed herein;

FIG. 6 illustrates a third embodiment of an example system that can beconfigured to implement an adaptive oracle-trained learning frameworkfor automatically building and maintaining a predictive machine learningmodel in accordance with some embodiments discussed herein;

FIG. 7 is a flow diagram of an example method for adaptive maintenanceof a predictive model for optimal processing of dynamic data inaccordance with some embodiments discussed herein;

FIG. 8 is a flow diagram of an example method for dynamically updating amodel core group of clusters along a single dimension k in accordancewith some embodiments discussed herein;

FIG. 9 is a flow diagram of an example method for dynamically updating acluster along a single dimension k in accordance with some embodimentsdiscussed herein;

FIG. 10 illustrates a diagram in which an exemplary dynamic data qualityassessment system is configured as a quality assurance component withinan adaptive oracle-trained learning framework in accordance with someembodiments discussed herein;

FIG. 11 is a flow diagram of an example method for automatic dynamicdata quality assessment of dynamic input data being analyzed using anadaptive predictive model in accordance with some embodiments discussedherein;

FIG. 12 is a flow diagram of an example method for using active learningfor processing potential training data for a machine-learning algorithmin accordance with some embodiments discussed herein;

FIG. 13 is an illustration of various different effects of activelearning and dynamic data quality assessment on selection of new datasamples to be added to an exemplary training data set for a binaryclassification model in accordance with some embodiments discussedherein; and

FIG. 14 illustrates a schematic block diagram of circuitry that can beincluded in a computing device, such as an adaptive learning system, inaccordance with some embodiments discussed herein.

DETAILED DESCRIPTION

The present invention now will be described more fully hereinafter withreference to the accompanying drawings, in which some, but not allembodiments of the inventions are shown. Indeed, these inventions may beembodied in many different forms and should not be construed as beinglimited to the embodiments set forth herein; rather, these embodimentsare provided so that this disclosure will satisfy applicable legalrequirements. Like numbers refer to like elements throughout.

As described herein, system components can be communicatively coupled toone or more of each other. Though the components are described as beingseparate or distinct, two or more of the components may be combined intoa single process or routine. The component functional descriptionsprovided herein including separation of responsibility for distinctfunctions is by way of example. Other groupings or other divisions offunctional responsibilities can be made as necessary or in accordancewith design preferences.

As used herein, the terms “data,” “content,” “information” and similarterms may be used interchangeably to refer to data capable of beingcaptured, transmitted, received, displayed and/or stored in accordancewith various example embodiments. Thus, use of any such terms should notbe taken to limit the spirit and scope of the disclosure. Further, wherea computing device is described herein to receive data from anothercomputing device, the data may be received directly from the anothercomputing device or may be received indirectly via one or moreintermediary computing devices, such as, for example, one or moreservers, relays, routers, network access points, base stations, and/orthe like. Similarly, where a computing device is described herein tosend data to another computing device, the data may be sent directly tothe another computing device or may be sent indirectly via one or moreintermediary computing devices, such as, for example, one or moreservers, relays, routers, network access points, base stations, and/orthe like.

Data being continuously sampled from a data stream representing datacollected from a variety of online sources (e.g., websites, blogs, andsocial media) is an example of dynamic data. A system that automaticallyperforms email fraud identification based on data sampled from a datastream is an example of a system that processes dynamic data. Analysisof such dynamic data typically is based on data-driven models that canbe generated using machine learning. One type of machine learning issupervised learning, in which a statistical predictive model is derivedbased on a training data set of examples representing the modeling taskto be performed.

The statistical distribution of the set of training data instancesshould be an accurate representation of the distribution of data thatwill be input to the model for processing. Additionally, the compositionof a training data set should be structured to provide as muchinformation as possible to the model. However, dynamic data isinherently inconsistent. The quality of the data sources may vary, thequality of the data collection methods may vary, and, in the case ofdata being collected continuously from a data stream, the overallquality and statistical distribution of the data itself may vary overtime.

Data quality fluctuations may affect the performance of a data-drivenmodel, and, in some cases when the data quality and/or statisticaldistribution of the data has changed over time, the model may have to bereplaced by a different model that more closely fits the changed data.Obtaining a set of accurately distributed, high-quality training datainstances for derivation of a model is difficult, time-consuming, and/orexpensive. Typically, high-quality training data instances are data thataccurately represent the task being modeled, and that have been verifiedand labeled by at least one oracle to ensure their accuracy. Once amodel has been derived from an initial training data set, being able toperform real time monitoring of the performance of the model as well asto perform data quality assessments on dynamic data as it is beingcollected can enable updating of the training data set so that the modelmay be adapted incrementally to fluctuations of quality and/orstatistical distribution of dynamic data. Incremental adaptation of amodel reduces the costs involved in repeatedly replacing the model.

As such, and according to some example embodiments, the systems andmethods described herein are therefore configured to implement anadaptive oracle-trained learning framework for automatically buildingand maintaining models that are developed using machine learningalgorithms. In embodiments, the framework leverages at least one oracle(e.g., a crowd) for automatic generation of high-quality training datato use in deriving a model. Once a model is trained, the frameworkmonitors the performance of the model and, in embodiments, leveragesactive learning and the oracle to generate feedback about the changingdata for modifying training data sets while maintaining data quality toenable incremental adaptation of the model.

Particular embodiments of the subject matter described in thisspecification can be implemented so as to realize one or more of thefollowing advantages. The framework is designed to provide high-qualitydata for less cost than current state of the art machine learningalgorithms/processes) across many real-world data sets. No initialtraining/testing phase is needed to generate a model. No expert humaninvolvement is needed to initially construct and over time maintain thetraining set and retrain the model. The framework continues to providehigh quality output data even if the input data change, since theframework determines how and when to adjust the training data set forincremental re-training of the model, and the framework can rely onverified data from an oracle (e.g., crowd sourced data) while the modelis being re-trained. The framework has the ability to utilize anyhigh-quality/oracle-provided data, regardless of how the data wasgenerated (e.g., the framework can make use of data that was notcollected as part of the training process, such as a separate process inan organization using an oracle to collect correct categories forbusiness).

There is a declarative framework/architecture for clear definition ofthe end goal for the output data. The framework enables end-users todeclare exactly what they want (i.e., high-quality data) without havingto understand how to produce such data. The system takes care of notonly training the model transparently (as described above), but alsodeciding for every input data instance if the system should get theanswer from the oracle or from a model. All of the details of machinelearning models and the accessing of an oracle (e.g., crowd-sourcing)are hidden from the user—the system may not even utilize a full-scalemachine learning model or an oracle as long as it can meet its qualityrequirements.

FIG. 1A illustrates a first embodiment of an example system that can beconfigured to implement an adaptive oracle-trained learning framework100 for automatically building and maintaining a predictive machinelearning model. In embodiments, an adaptive oracle-trained learningframework 100 comprises a predictive model 130 (e.g., a classifier) thathas been generated using machine learning based on a set of trainingdata 120, and that is configured to generate a judgment about unlabeledinput data 105 in response to receiving a feature representation of theinput data 105; an input data analysis component 110 for generating afeature representation of the input data 105; an accuracy assessmentcomponent 135 for providing an estimated assessment of the accuracy ofthe judgment of the input data and/or the quality of the input data 105;an active labeler 140 to facilitate the generation and maintenance ofoptimized training data 120 by identifying possible updates to thetraining data 120; at least one oracle 150 (e.g., a crowd, a flat fileof data verification results previously received from one or moreoracles, and/or data verification software) for providing a verifiedtrue label for input data 105 identified by the active labeler 140; alabeled data reservoir 155 for storing input data 105 that have receivedtrue labels from the oracle 150; and an accuracy assurance component 160for determining whether the system output processed data 165 satisfiesan accuracy threshold.

In embodiments, the predictive model 130 is a trainable model that isderived from the training data 120 using supervised learning. Anexemplary trainable model (e.g., a trainable classifier) is adapted torepresent a particular task (e.g., a binary classification task in whicha classifier model returns a judgment as to which of two groups an inputdata instance 105 most likely belongs) using a set of training data 120that consists of examples of the task being modeled. Referring to theexemplary binary classification task, each training example in atraining data set from which the classifier is derived may represent aninput to the classifier that is labeled representing the group to whichthe input data instance belongs.

Supervised learning is considered to be a data-driven process, becausethe efficiency and accuracy of deriving a model from a set of trainingdata is dependent on the quality and composition of the set of trainingdata. As discussed previously, obtaining a set of accuratelydistributed, high-quality training data instances typically isdifficult, time-consuming, and/or expensive. For example, the trainingdata set examples for a classification task should be balanced to ensurethat all class labels are adequately represented in the training data.Credit card fraud detection is an example of a classification task inwhich examples of fraudulent transactions may be rare in practice, andthus verified instances of these examples are more difficult to collectfor training data.

In some embodiments, an initial predictive model and a high-qualitytraining data set used to derive the model via supervised learning maybe generated automatically within an adaptive oracle-trained learningframework (e.g., framework 100) by processing a stream of unlabeleddynamic data.

FIG. 2 is a flow diagram of an example method 200 for automaticallygenerating an initial predictive model and a high-quality training dataset used to derive the model within an adaptive oracle-trained learningframework. For convenience, the method 200 will be described withrespect to a system that includes one or more computing devices andperforms the method 200. Specifically, the method 200 will be describedwith respect to processing of dynamic data by an adaptive oracle-trainedlearning framework 100.

In embodiments, a framework 100 is configured initially 205 to includean untrained predictive model 130 and an empty training data set 120. Insome embodiments, at framework setup, the framework 100 is assigned 210an input configuration parameter describing a desired accuracy A forprocessed data 165 to be output from the framework 100. In someembodiments, the desired accuracy A may be a minimum accuracy thresholdto be satisfied for each processed data instance 165 to be output fromthe framework while, in some alternative embodiments, the desiredaccuracy A may be an average accuracy to be achieved for a set ofprocessed data 165. The values chosen to describe the desired accuracy Afor sets of processed data across various embodiments may vary.

In some embodiments, an initially configured adaptive oracle-trainedlearning framework 100 that includes an untrained model and emptytraining data set may be “cold started” 215 by streaming unlabeled inputdata instances 105 into the system for processing. The model 130 andtraining data 120 are then adaptively updated 230 by the framework 100until the processed data instances 165 produced by the model 130consistently achieve 225 the desired accuracy A as specified by thesingle input configuration parameter (i.e., the process ends 235 whenthe system reaches a “steady state”).

In some alternative embodiments, one or more high-quality initialtraining data sets may be generated automatically from a pool ofunlabeled data instances. In some embodiments, the unlabeled datainstances are dynamic data that have been collected previously from atleast one data stream during at least one time window. In someembodiments, the collected data instances are multi-dimensional data,where each data instance is assumed to be described by a set ofattributes (i.e., features hereinafter). In some embodiments, the inputdata analysis component 110 performs a distribution-based featureanalysis of the collected data. In some embodiments, the featureanalysis includes clustering the collected data instances intohomogeneous groups across multiple dimensions using an unsupervisedlearning approach that is dependent on the distribution of the inputdata as described, for example, in U.S. patent application Ser. No.14/038,661 entitled “Dynamic Clustering for Streaming Data,” filed onSep. 16, 2013, and which is incorporated herein in its entirety. In someembodiments, the clustered data instances are sampled uniformly acrossthe different homogeneous groups, and the sampled data instances aresent to an oracle 150 (as shown in FIG. 1) for labeling.

FIGS. 3 and 4 respectively illustrate and describe a flowchart for anexemplary method 400 for automatically determining whether an inputmulti-dimensional data instance is an optimal choice for labeling andinclusion in at least one initial training data set using an adaptiveoracle-trained learning framework 100. The depicted method 400 isdescribed with respect to a system that includes one or more computingdevices and performs the method 400.

In embodiments, the system receives an input multi-dimensional datainstance having k attributes 405. Determining whether an inputmulti-dimensional data instance is a preferred choice for labeling andinclusion in at least one initial training data set 420 is based in parton an operator estimation score and/or on a global estimation scoreassigned to the data instance.

Turning to FIG. 3 for illustration, in embodiments, an inputmulti-dimensional data instance having k attributes is represented by afeature vector x 305 having k elements (x₁, x₂, . . . , x_(k)), whereeach element in feature vector x represents the value of a correspondingattribute. Each of the elements is assigned to a particularcluster/distribution of the corresponding attribute using aclustering/distribution algorithm 320 (e.g., dynamic clustering asdescribed in U.S. patent application Ser. No. 14/038,661).

In embodiments, an operator estimate 302 is calculated 410 (as shown inFIG. 4) for each feature. An operator represents a single data cleaningmanipulation action applied to a feature. Each operator (e.g.,normalization) has at most one statistical model to power its cleaningof the data. In some embodiments, an operator estimate 302 may includemultiple operators chained together.

Using an input from a clustering/distribution algorithm 320 respectivelyassociated with each operator estimate, a classifier 330, implementing aper operator estimator trained on the distribution, then determines aper operator estimate confidence value estimating probabilityP_(n)(x|T), a probability based on the operator estimator n that thefeature vector x belongs to the cluster/distribution T ofmulti-dimensional data instance feature vectors to which it has beenassigned. The data instance is assigned an operator estimation scorerepresenting the values of the set of per operator estimates 360. Forexample, referring to the exemplary binary classification task, a higheroperator estimation score indicates that the data instance would beassigned to one of the two classes by a binary classifier with a greaterdegree of confidence/certainty because the data instance is at a greaterdistance from the decision boundary of the classification task.Conversely, a lower operator estimation score indicates that theassignment of the data instance to one of the classes by the binaryclassifier would be at a lower degree of confidence/certainty becausethe data instance is located close to or at the decision boundary forthe classification task.

In some embodiments, the data instance, represented by feature vector x305, is assigned to each of a group of N global datasets 310 containingdata instances of the same type as the input data instance, and anestimated distribution 312 is calculated for each dataset. In someembodiments, the group of N global datasets 310 have varyingtimeline-based sizes (e.g., each dataset respectively represents a setof data instances collected during a weekly, monthly, or quarterly timewindow). Using an input from a clustering/distribution algorithm 340respectively associated with each of the group of datasets, a classifier350 implementing a per dataset estimator trained on each distributiondetermines a per dataset global estimate confidence value estimatingprobability P_(G)(x|DY), a probability that the input data instancebelongs to the global distribution represented by its associated datasetY. The input data instance is assigned 415 a global estimation scorerepresenting the values of the set of per dataset global estimates 370.A data instance having a higher global estimation score is more likelyto belong to a global distribution of data instances of the same type.

Returning to FIG. 1A, once the model 130 is derived, in someembodiments, the framework 100 may further optimize the initial trainingdata 120 by processing the training data set examples using the model130, monitoring the performance of the model 130 during the processing,and then adjusting the input data feature representation and/or thecomposition and/or distribution of the training dataset based on ananalysis of the model's performance.

In some embodiments, a predictive model 130 and training data 120deployed within an adaptive oracle-trained learning framework 100 forprocessing dynamic data may be updated incrementally in response tochanges in the quality and/or characteristics of the dynamic data toachieve optimal processing of newly received input data 105. Inembodiments, an input data instance 105 may be selected by the frameworkas a potential training example based on an accuracy assessmentdetermined from the model output generated from processing the inputdata instance 105 and/or attributes of the input data instance. Selecteddata instances receive true labels from at least one oracle 150, and arestored in a labeled data reservoir 155. In embodiments, the trainingdata 120 are updated using labeled data selected from the labeled datareservoir 155.

FIG. 5 is a flow diagram of an example method 500 for adaptiveprocessing of input data by an adaptive learning framework. The method500 is described with respect to a system that includes one or morecomputing devices that process dynamic data by an adaptiveoracle-trained learning framework 100. For clarity and withoutlimitation, method 500 will be described for an exemplary system inwhich the predictive model 130 is a trainable classifier.

In embodiments, the system receives 505 model output (i.e., a judgment)from a classifier model (e.g., model 130) that has processed an inputdata instance 105. Exemplary model output may be a predicted labelrepresenting a category/class to which the input data instance is likelyto belong. In some embodiments, the judgment includes a confidence valuethat represents the certainty of the judgment. For example, if the inputdata instance is very different from any of the training data instances,the model output that is generated from that input data has a lowconfidence. The confidence value may be defined by any well-knowndistance metric (e.g., Euclidean distance, cosine, Jaccard distance). Insome embodiments, an associated judgment confidence value may be aconfidence score.

Referring to the example in which the classification task is a binaryclassification task, the judgment may be based on the model performing amapping of the input data instance feature set into a binary decisionspace representing the task parameters, and the associated judgmentconfidence value may be a confidence score representing the distance inthe binary decision space between the mapping of the data instancefeature set and a decision boundary at the separation of the two classesin the decision space. A mapping located at a greater distance from thedecision boundary may be associated with a higher confidence score,representing a class assignment predicted at a greaterconfidence/certainty. Conversely, a mapping that is located close to thedecision boundary may be associated with a lower confidence score,representing a class assignment predicted at a lowerconfidence/certainty.

In embodiments, the system executes 510 an accuracy assessment of themodel output and/or the input data instance quality. In someembodiments, the accuracy assessment is an accuracy value representingthe accuracy of the model judgment.

In some embodiments, accuracy assessment may include one or acombination of model-dependent and model-independent analytics. In someembodiments in which the model judgment includes a confidence score,accuracy assessment may include that confidence score directly. In someembodiments, a second predictive model may be used to estimate theframework model accuracy on a per-instance level. For example, a randomsample of data instances labeled by the framework model can be sent tothe oracle for verification, and that sample then can be used astraining data to train a second model to predict the probability thatthe framework model judgment is correct.

In some embodiments, accuracy assessment is implemented by a qualityassurance component 160 to generate an aggregate/moving window estimateof accuracy. In some embodiments, the quality assurance component 160 isconfigured as a dynamic data quality assessment system described, forexample, in U.S. patent application Ser. No. 14/088,247 entitled“Automated Adaptive Data Analysis Using Dynamic Data QualityAssessment,” filed on Nov. 22, 2013, and which is incorporated herein inits entirety. An exemplary dynamic quality assessment system isdescribed in detail with reference to FIG. 10 and method 700 of FIG. 7.

In embodiments, the system analyzes 515 the assessed model output andinput data instance by determining whether the input data instanceshould be selected for potential inclusion in the training data set 120.In an instance in which the input data instance is selected 520 as apossible training example, the system sends the instance to an oraclefor true labeling.

In some embodiments, the analysis (“active labeling” hereinafter)includes active learning. Active learning, as described, for example, inSettles, Burr (2009), “Active Learning Literature Survey”, ComputerSciences Technical Report 1648, University of Wisconsin—Madison, is asemi-supervised learning process in which the distribution of thetraining data set instances can be adjusted to optimally represent amachine learning problem. For example, a machine-learning algorithm mayachieve greater accuracy with fewer training examples if the selectedtraining data set instances are instances that will provide maximuminformation to the model about the problem. Referring to the trainableclassifier example, data instances that may provide maximum informationabout a classification task are data instances that result in mappingsin decision space that are closer to the decision boundary. In someembodiments, these data instances may be identified automaticallythrough active labeling analysis because their judgments are associatedwith lower confidence scores, as previously described.

Additionally and/or alternatively, in some embodiments, thedetermination of whether the input data instance should be selected forpotential inclusion in the training data set 120 may include a dataquality assessment. In some embodiments, active labeling analysis may bebased on a combination of model prediction accuracy and data quality.

In some embodiments, in response to receiving a labeled data instancefrom the oracle, the system stores 530 the labeled data instance in alabeled data reservoir 155, from which new training data instances maybe selected for updates to training data 120. In some embodiments, thelabeled data reservoir grows continuously as labeled data instances arereceived by the system and then stored.

In embodiments, the system outputs 545 the labeled data instance beforethe process ends 550. The true label assigned to the data instance bythe oracle ensures the accuracy of the output, regardless of the outcomeof the accuracy assessment of the model performance and/or the inputdata instance quality.

In an instance in which the input data instance is not selected 520 as apossible training example, in embodiments, the system sends 535 theassessed input data instance and the model output for accuracyassurance. In some embodiments, as previously described, accuracyassurance may include determining whether the assessed input datainstance and the model output satisfy a desired accuracy A that has beenreceived as a declarative configuration parameter by the system.

In an instance in which the desired accuracy is satisfied 540, thesystem outputs 545 the processed data instance and the process ends 550.

In an instance in which the desired accuracy is not satisfied 540, inembodiments, the system sends 525 the input data instance to the oraclefor true labeling. In some embodiments, the labeled data instance isadded 530 to the data reservoir and then output 545 before the processends 550, as previously described.

FIG. 1B illustrates a second embodiment of an example system that can beconfigured to implement an adaptive oracle-trained learning framework100B that is further configured to include a training data managercomponent 156 for curating the training data 120 used to train and/orre-train the predictive model 130. In various embodiments, curating thetraining data 120 may include one or a combination of determining thecomposition of the training data set 120 and determining when tore-train the model 130.

In embodiments, the training data manager 156 may update the trainingdata 120 by selecting, from the labeled data reservoir 155, an optimalsubset of training data samples to use in a training data 120 update. Inembodiments in which the input data instances are multi-dimensionaldata, the criteria used by the training data manager 156 for selectingthe optimal subset of training data samples may be based at least inpart on a feature analysis used to generate the initial training dataset from which the model is derived, as described previously withreference to FIGS. 3-4. In some embodiments, the training data manager156 selection criteria may be used to update the feature extractioncriteria implemented by the input data analysis component 110.

In some embodiments, the data stored in the labeled data reservoir 155are not included in the training data set 120. As previously describedwith reference to method 500, in some embodiments, a labeled datareservoir 155 includes a pool of possible training data that have beencollected continuously over time from input data being processed by themodel 130. Each of the data instances in the reservoir has been assigneda true label (e.g., a verified category identifier for a classificationtask) by a trusted source (i.e., an oracle). In some embodiments, thelabeled reservoir data are collected as a result of having beenselected, for different purposes, by one of multiple sources (e.g., theactive labeler component 140 and the accuracy assurance component 160 offramework 100B). For example, referring to an exemplary classificationtask, active learning may select possible training data instances frominput data in which the predicted judgment is close to the decisionboundary (thus providing maximum information about the task to themodel), while dynamic quality assessment may select possible trainingdata instances from input data based on a statistical decision.Selection criteria for selection of data instances for labeling in abinary classification task are described in detail with reference toFIG. 13.

In embodiments, selecting a set of labeled data instances from thelabeled data reservoir 155 is based on a determination that re-trainingthe model 130 with updated training data likely will result in improvedmodel performance. In some embodiments, this determination is based atleast in part on analyzing the distribution and quality of the trainingdata. For example, in some embodiments in which the predictive model 130is a classifier, the selection may be based at least in part onmaintenance and/or improvement of class balance in the training data(e.g., adding training examples of rare categories). In a secondexample, the selection may be based at least in part on adding exampleshaving higher data quality than the training data. In a third example,the selection may be based at least in part on adding examples that havehigher accuracy assessment scores, as previously described withreference to method 500. Additionally and/or alternatively, thisdetermination may be based on feedback signals received from one or morecomponents of the system (e.g., the active labeler component 140 and theaccuracy assurance component 160) and/or data freshness (i.e., addingmore newer data to a training data set than older data).

In some embodiments, one or more candidate training data sets may begenerated by updating the current training data 120 using a selected setof labeled data instances. In embodiments, updating the current trainingdata 120 may include pruning the training data set and replacing removeddata with at least a subset of the selected labeled data. In someembodiments, pruning the training data set may include removing outliersfrom the training data. In some embodiments in which the model is aclassifier, removing outliers may be implemented on a per class basis(e.g., removing a training data sample describing a patient who has beenclassified as having a particular disease but has attributes that areinconsistent with the attributes describing other patients who have beenclassified as having that disease). Additionally and/or alternatively,updating the training data may include pruning outliers from theselected labeled data before updating the training data.

In some embodiments, multiple different candidate training data sets maybe generated by updating the current training data in different ways. Insome embodiments, each of the differently updated training data sets mayrepresent respective updating of the current training data using adifferent subset of the selected set of labeled data instances. In someembodiments, updating the current training data may be based on a greedyalgorithm in which new batches of training data instances are addedincrementally to the training data set. Before each batch is added, atest is performed to determine if updating the training data by addingthe batch will improve the model performance. Additionally and/oralternatively, in some embodiments, updating the current training datamay be based on a non-greedy algorithm in which, for example, all thecurrent training data are removed and replaced with a completely new setof training data.

In some embodiments, a candidate model is derived respectively from eachcandidate training data set using supervised learning. In embodiments,each candidate model's performance is compared to the current modelperformance, and an assessment is made to determine whether thecandidate model performance is improved from the current modelperformance. In some embodiments, generating the assessment includes A/Btesting in which the same set of data is input to the current model andto at least one candidate model that has been trained using candidatetraining data. In some embodiments, comparing the performance of thecurrent model and a candidate model is implemented by cross-validation.There are a variety of well-known statistical techniques for comparingresults; the choice of statistical technique for comparing theperformance of models is not critical to the invention.

In some embodiments in which the model performs real time analysis ofinput data from a datastream (e.g., embodiments of dynamic data analysissystem 100B), the input datastream may be forked to multiple models sothat A/B testing is implemented in parallel for all the models. Inembodiments, the updated training data and its associated re-trainedmodel are instantiated into the framework 100B in an instance in whichthe assessment indicates that re-training the current model using theupdated training data results in improved model performance.

FIG. 6 illustrates a third embodiment of an example system that can beconfigured to implement an adaptive oracle-trained learning framework600 for automatically building and maintaining a predictive machinelearning model. In embodiments, an adaptive oracle-trained learningframework 600 comprises a predictive model 630 (e.g., a classifier) thathas been generated using machine learning based on a set of trainingdata 620, and that is configured to generate a judgment about the inputdata 605 in response to receiving a feature representation of the inputdata 605; an input data analysis component 610 for generating a featurerepresentation of the input data 605 and maintaining optimized,high-quality training data 620; a quality assurance component 660 forassessment of the quality of the input data 605 and of the quality ofthe judgments of the predictive model 630; an active learning component640 to facilitate the generation and maintenance of optimized trainingdata 620; and at least one oracle 650 (e.g., a crowd, a flat file ofdata verification results previously received from one or more oracles,and/or data verification software) for providing a verified qualitymeasure for the input data 605 and its associated judgment.

In embodiments, new unlabeled data instances 605, sharing the particulartype of the examples in the training data set 620, are input to theframework 600 for processing by the predictive model 630. For example,in some embodiments, each new data instance 605 may be multi-dimensionaldata collected from one or more online sources describing a particularbusiness (e.g., a restaurant, a spa), and the predictive model 630 maybe a classifier that returns a judgment as to which of a set ofcategories the business belongs.

In embodiments, the predictive model 630 generates a judgment (e.g., anidentifier of a category) in response to receiving a featurerepresentation of an unlabeled input data instance 605. In someembodiments, the feature representation is generated during input dataanalysis 610 using a distribution-based feature analysis, as previouslydescribed. In some embodiments, the judgment generated by the predictivemodel 630 includes a confidence value. For example, in some embodimentsin which the predictive model 630 is performing a classification task,the confidence value included with a classification judgment is a scorerepresenting the distance in decision space of the judgment from thetask decision boundary, as previously described with reference to FIG.3. Classification judgments that are more certain are associated withhigher confidence scores because those judgments are at greaterdistances in decision space from the task decision boundary.

In some embodiments, a quality assurance component 660 monitors thequality of the predictive model performance as well as the quality ofthe input data being processed. The processed data 665 and, in someembodiments, an associated judgment are output from the framework 600 ifthey are determined to satisfy a quality threshold.

FIG. 7 is a flow diagram of an example method 700 for adaptivemaintenance of a predictive model for optimal processing of dynamicdata. For convenience, the method 700 will be described with respect toa system that includes one or more computing devices and performs themethod 700. Specifically, the method 700 will be described with respectto processing of dynamic data by an adaptive oracle-trained learningframework 600. For clarity and without limitation, method 700 will bedescribed for an exemplary system in which the predictive model 630 is atrainable classifier.

In embodiments, the system receives 705 a classification judgment aboutan input data instance from the classifier. The judgment includes aconfidence value that represents the certainty of the judgment. In someembodiments, the confidence value included with a classificationjudgment is a score representing the distance in decision space of thejudgment from the task decision boundary, as previously described withreference to FIG. 3.

In embodiments, the system sends 710 the judgment and the input datainstance to a quality assurance component 660 for quality analysis. Insome embodiments, quality analysis includes determining 715 whether thejudgment confidence value satisfies a confidence threshold.

In an instance in which the judgment confidence value satisfies theconfidence threshold and the data satisfy a quality threshold, thesystem outputs 730 the data processed by the modeling task and theprocess ends 735.

In an instance in which the judgment confidence value does not satisfythe confidence threshold, the system sends 720 the input data sample toan oracle for verification. In some embodiments, verification by theoracle may include correction of the data, correction of the judgment,and/or labeling the input data. In response to receiving the verifieddata from the oracle, the system optionally may update the training data620 using the verified data before the process ends 735. In someembodiments, updating the training data may be implemented using thequality assurance component 660 and/or the active learning component640, which both are described in more detail with reference to FIGS.10-12.

In some embodiments, the training data set 620 is updated continuouslyas new input data are processed, so that the training data reflectoptimal examples of the current data being processed. The training dataexamples thus are adapted to fluctuations in quality and composition ofthe dynamic data, enabling the predictive model 630 to be re-trained. Insome embodiments, the model 630 may be re-trained using the currenttraining data set periodically or, alternatively, under a re-trainingschedule. In this way, a predictive model can maintain its functionaleffectiveness by adapting to the dynamic nature of the data beingprocessed. Incrementally adapting an existing model is less disruptiveand resource-intensive than replacing the model with a new model, andalso enables a model to evolve with the dynamic data. In someembodiments, an adaptive oracle-trained learning framework 600 isfurther configured to perform two sample hypothesis testing (A/Btesting, hereinafter) to verify the performance of the predictive model630 after re-training.

In some embodiments, the system performs a new distribution-basedfeature analysis of the training data 620 in response to the addition ofnewly labeled data instances. In some embodiments, for example, a newdistribution-based feature analysis of the data by dynamic clusteringmay be performed by the input data analysis component 610 using method800, a flow chart of which is illustrated in FIG. 8, and using method900, a flow chart of which is illustrated in FIG. 9. Method 800 andmethod 900 are described in detail in U.S. patent application Ser. No.14/038,661.

FIG. 8 is a flow diagram of an example method 800 for dynamicallyupdating a model core group of clusters along a single dimension k. Forconvenience, the method 800 will be described with respect to a systemthat includes one or more computing devices and performs the method 800.

In embodiments, the system receives 805 X_(k), defined as a model coregroup of clusters 105 of objects based on a clustering dimension k. Forexample, in embodiments, clustering dimension k may represent ageographical feature of an object represented by latitude and longitudedata. In embodiments, the system receives 810 a new data stream S_(k)representing the objects in X_(k), where the n-dimensional vectorrepresenting each object O^(i) includes the k^(th) dimension.

In embodiments, the system classifies 815 each of the objectsrepresented in the new data stream 125 as respectively belonging to oneof the clusters within X_(k). In some embodiments, an object isclassified by determining, based on a k-means algorithm, C_(k), thenearest cluster to the object in the k^(th) dimension. In embodiments,classifying an object includes adding that object to the cluster C_(k).

In embodiments, the system determines 820 whether to update X_(k) inresponse to integrating each of the objects into its respective nearestcluster.

FIG. 9 is a flow diagram of an example method 900 for dynamicallyupdating a cluster along a single dimension k. For convenience, themethod 900 will be described with respect to a system that includes oneor more computing devices and performs the method 900. Specifically, themethod 900 will be described with respect to implementation of steps 815and 820 of method 800.

In embodiments, the system receives 905 a data point from a new datastream S_(k) representing O^(i) _(k), an instance of clusteringdimension k describing a feature of an object being described in newdata stream S. For example, in embodiments, the data point may belatitude and longitude representing a geographical feature included inan n-dimensional feature vector describing the object.

In embodiments, the system adds 910 the object to the closest clusterC_(k)∉S_(k) for O^(i) _(k.), and, in response, updates 915 theproperties of cluster C_(k). In embodiments, updating the propertiesincludes calculating σ_(k), the standard deviation of the objects incluster C_(k).

In embodiments, the system determines 920 whether to update clusterC_(k) using its updated properties. In some embodiments, updatingcluster C_(k) may include splitting cluster C_(k) or merging clusterC_(k) with another cluster within the core group of clusters. In someembodiments, the system determines 920 whether to update cluster C_(k)using σ_(k).

In some embodiments, the system may optimize an initial training dataset 120 that has been generated from a pool of unlabeled data byimplementing method 300 to process the initial training data set 120using the predictive model 130 generated from the initial training dataand updating the training data set 120 based on the quality assessmentsof the model judgments of the data instances. The system may repeatimplementation of method 300 until the entire training data set meets apre-determined quality threshold.

In some embodiments, the quality assurance component 160 is configuredas a dynamic data quality assessment system described, for example, inU.S. patent application Ser. No. 14/088,247 entitled “Automated AdaptiveData Analysis Using Dynamic Data Quality Assessment,” filed on Nov. 22,2013, and which is incorporated herein in its entirety.

FIG. 10 illustrates a diagram 1000, in which an exemplary dynamic dataquality assessment system is configured as a quality assurance component160 within an adaptive oracle-trained learning framework 100, asdescribed in detail in U.S. patent application Ser. No. 14/088,247. Thequality assurance component 160 includes a quality checker 1062 and aquality blocker 1064, and maintains a data reservoir 1050 within theframework 100.

In some embodiments, quality analysis performed by the quality assurancecomponent 160 may include determining the effect of data qualityfluctuations on the performance of the predictive model 130 generatedfrom the training data 120, identifying input data samples thatcurrently best represent examples of the modeled task, and modifying thetraining data 120 to enable the model to be improved incrementally bybeing re-trained with a currently optimal set of training data examples.In some embodiments, dynamic data quality assessment may be performedautomatically by the quality assurance component using method 1000, aflow chart of which is illustrated in FIG. 11. Method 1000 is describedin detail in U.S. patent application Ser. No. 14/088,247.

FIG. 11 is a flow diagram of an example method 1100 for automaticdynamic data quality assessment of dynamic input data being analyzedusing an adaptive predictive model. For convenience, the method 1100will be described with respect to a system that includes one or morecomputing devices and performs the method 1100.

For clarity and without limitation, method 1100 will be described for ascenario in which the input data sample is a sample of data collectedfrom a data stream, and in which the predictive model is a trainableclassifier, adapted based on a set of training data. In someembodiments, a data cleaning process has been applied to the input datasample. The classifier is configured to receive a feature vectorrepresenting a view of the input data sample and to output a judgmentabout the input data sample.

In embodiments, the system receives 1105 a judgment about an input datasample from a classifier. In some embodiments, the judgment includes aconfidence value that represents a certainty of the judgment. Forexample, in some embodiments, the confidence value may be a score thatrepresents the distance of the judgment from the decision boundary indecision space for the particular classification problem modeled by theclassifier. The confidence score is higher (i.e., the judgment is morecertain) for judgments that are further from the decision boundary.

As previously described with reference to FIG. 1A, in some embodiments,the system maintains a data reservoir of data samples that have the samedata type as the input data sample and that have been processedpreviously by the classifier. In embodiments, the system analyzes 1110the input data sample in terms of the summary statistics of the datareservoir and/or the judgment. In some embodiments, analysis of thejudgment may include comparing a confidence value associated with thejudgment to a confidence threshold and/or determining whether thejudgment matches a judgment determined previously for the input sampleby a method other than the classifier.

In embodiments, the system determines 1115 whether to send a qualityverification request for the input data sample to an oracle based on theanalysis. For example, in some embodiments, the system may determine tosend a quality verification request for the input data sample if thedata sample is determined statistically to be an outlier to the datasamples in the data reservoir. In another example, the system maydetermine to send a quality verification request for the input datasample if the judgment is associated with a confidence value that isbelow a confidence threshold. In a third example, the system maydetermine to send a quality verification request for the input datasample if the judgment generated by the classifier does not match ajudgment generated by another method, even if the confidence valueassociated with the classifier's judgment is above the confidencethreshold.

In an instance in which the system determines 1120 that a qualityrequest will not be sent to the oracle, the process ends 1140.

In an instance in which the system determines 1120 that a qualityrequest will be sent to the oracle, in some embodiments, the system maybe configured to send requests to any of a group of different oracles(e.g., a crowd, a flat file of data verification results previouslyreceived from one or more oracles, and/or data verification software)and the system may select the oracle to receive the quality verificationrequest based on attributes of the input data sample.

In response to receiving a data quality estimate of the input datasample from the oracle, in embodiments, the system determines 1125whether to add the input data sample, its associated judgment, and itsdata quality estimate to the data reservoir. In some embodiments, thedetermination may be based on whether the input data samplestatistically belongs in the data reservoir. Additionally and/oralternatively, the determination may be based on whether the judgment isassociated with a high confidence value and/or matches a judgment madeby a method different from the classifier (e.g., the oracle).

In an instance in which the system determines 1125 that the new datasample is not to be added to the reservoir, the process ends 1140.

In an instance in which the system determines 1125 that the new datasample is to be added to the reservoir, before the process ends 1140,the system optionally updates summary statistics for the reservoir.

In some embodiments, the generation and maintenance of an optimizedtraining data set 120 for the predictive model 130 component of theframework is facilitated by the active learning component 140. Activelearning, as described, for example, in Settles, Burr (2009), “ActiveLearning Literature Survey”, Computer Sciences Technical Report 1648,University of Wisconsin—Madison, is a semi-supervised learning processin which the distribution of the training data set instances can beadjusted to optimally represent a machine learning problem.

FIG. 12 is a flow diagram of an example method 1200 for using activelearning for processing potential training data for a machine-learningalgorithm. For convenience, the method 1200 will be described withrespect to a system that includes one or more computing devices andperforms the method 1200. Specifically, the method 1200 will bedescribed with respect to processing of dynamic data by the activelearning component 140 of an adaptive oracle-trained learning framework100. For clarity and without limitation, method 1200 will be describedfor an exemplary system in which the machine-learning algorithm is atrainable classifier.

In embodiments, the system receives 1205 an input data sample and itsassociated judgment that includes a confidence value determined to notsatisfy a confidence threshold.

A machine-learning algorithm may achieve greater accuracy with fewertraining labels if the training data set instances are chosen to providemaximum information about the problem. Referring to the classifierexample, data instances that provide maximum information about theclassification task are data instances that result in classifierjudgments that are closer to the decision boundary. In some embodiments,these data instances may be recognized automatically because theirjudgments are associated with lower confidence scores, as previouslydescribed.

In embodiments, the system sends 1210 the input data sample to an oraclefor verification. In some embodiments, verification by the oracle mayinclude correction of the data, correction of the judgment, and/orlabeling the input data.

In embodiments, the system optionally may update 1215 the training data120 using the verified data. Thus, the system can leverage theclassifier's performance in real time or near real time to adapt thetraining data set to include a higher frequency of examples thatcurrently result in judgments having the greatest uncertainty.

In embodiments, a dynamic data quality assessment system 160 maycomplement an active learning component 140 to ensure that anymodifications of the training data by adding new samples to the trainingdata set do not result in over-fitting the model to the problem.

FIG. 13 is an illustration 1300 of the different effects of activelearning and dynamic data quality assessment on selection of new datasamples to be added to an exemplary training data set for a binaryclassification model. A model (i.e., a binary classifier) assigns ajudgment value 1310 to each data point; a data point assigned a judgmentvalue that is close to either 0 or 1 has been determined with certaintyby the classifier to belong to one or the other of two classes. Ajudgment value of 0.5 represents a situation in which the classificationdecision was not certain; an input data sample assigned a judgment valueclose to 0.5 by the classifier represents a judgment that is close tothe decision boundary 1315 for the classification task.

The dashed curve 1340 represents the relative frequencies of newtraining data samples that would be added to a training data set forthis binary classification problem by an active learning component. Toenhance the performance of the classifier in situations where thedecision was uncertain, the active learning component would choose themajority of new training data samples from input data that resulted indecisions near the decision boundary 1315.

The solid curve 1330 represents the relative frequencies of new trainingdata samples that would be added to the training data set by dynamicquality assessment. Instead of choosing new training data samples basedon the judgment value, in some embodiments, dynamic quality assessmentmay choose the majority of new training data samples based on whetherthey statistically belong in the data reservoir. It also may choose toadd new training data samples that were classified with certainty (i.e.,having a judgment value close to either 0 or 1), but erroneously (e.g.,samples in which the judgment result from the classifier did not matchthe result returned from the oracle).

FIG. 14 shows a schematic block diagram of circuitry 1400, some or allof which may be included in, for example, an adaptive oracle-trainedlearning framework 100. As illustrated in FIG. 14, in accordance withsome example embodiments, circuitry 1400 can include various means, suchas processor 1402, memory 1404, communications module 1406, and/orinput/output module 1408. As referred to herein, “module” includeshardware, software and/or firmware configured to perform one or moreparticular functions. In this regard, the means of circuitry 1400 asdescribed herein may be embodied as, for example, circuitry, hardwareelements (e.g., a suitably programmed processor, combinational logiccircuit, and/or the like), a computer program product comprisingcomputer-readable program instructions stored on a non-transitorycomputer-readable medium (e.g., memory 1404) that is executable by asuitably configured processing device (e.g., processor 1402), or somecombination thereof

Processor 1402 may, for example, be embodied as various means includingone or more microprocessors with accompanying digital signalprocessor(s), one or more processor(s) without an accompanying digitalsignal processor, one or more coprocessors, one or more multi-coreprocessors, one or more controllers, processing circuitry, one or morecomputers, various other processing elements including integratedcircuits such as, for example, an ASIC (application specific integratedcircuit) or FPGA (field programmable gate array), or some combinationthereof. Accordingly, although illustrated in FIG. 14 as a singleprocessor, in some embodiments, processor 1402 comprises a plurality ofprocessors. The plurality of processors may be embodied on a singlecomputing device or may be distributed across a plurality of computingdevices collectively configured to function as circuitry 1400. Theplurality of processors may be in operative communication with eachother and may be collectively configured to perform one or morefunctionalities of circuitry 1400 as described herein. In an exampleembodiment, processor 1402 is configured to execute instructions storedin memory 1404 or otherwise accessible to processor 1402. Theseinstructions, when executed by processor 1402, may cause circuitry 1400to perform one or more of the functionalities of circuitry 1400 asdescribed herein.

Whether configured by hardware, firmware/software methods, or by acombination thereof, processor 1402 may comprise an entity capable ofperforming operations according to embodiments of the present inventionwhile configured accordingly. Thus, for example, when processor 1402 isembodied as an ASIC, FPGA or the like, processor 1402 may comprisespecifically configured hardware for conducting one or more operationsdescribed herein. Alternatively, as another example, when processor 1402is embodied as an executor of instructions, such as may be stored inmemory 1404, the instructions may specifically configure processor 1402to perform one or more algorithms and operations described herein, suchas those discussed in connection with FIGS. 1-12.

Memory 1404 may comprise, for example, volatile memory, non-volatilememory, or some combination thereof. Although illustrated in FIG. 14 asa single memory, memory 1404 may comprise a plurality of memorycomponents. The plurality of memory components may be embodied on asingle computing device or distributed across a plurality of computingdevices. In various embodiments, memory 1404 may comprise, for example,a hard disk, random access memory, cache memory, flash memory, a compactdisc read only memory (CD-ROM), digital versatile disc read only memory(DVD-ROM), an optical disc, circuitry configured to store information,or some combination thereof. Memory 1404 may be configured to storeinformation, data (including analytics data), applications,instructions, or the like for enabling circuitry 1400 to carry outvarious functions in accordance with example embodiments of the presentinvention. For example, in at least some embodiments, memory 1404 isconfigured to buffer input data for processing by processor 1402.Additionally or alternatively, in at least some embodiments, memory 1404is configured to store program instructions for execution by processor1402. Memory 1404 may store information in the form of static and/ordynamic information. This stored information may be stored and/or usedby circuitry 1400 during the course of performing its functionalities.

Communications module 1406 may be embodied as any device or meansembodied in circuitry, hardware, a computer program product comprisingcomputer readable program instructions stored on a computer readablemedium (e.g., memory 1404) and executed by a processing device (e.g.,processor 1402), or a combination thereof that is configured to receiveand/or transmit data from/to another device, such as, for example, asecond circuitry 1400 and/or the like. In some embodiments,communications module 1406 (like other components discussed herein) canbe at least partially embodied as or otherwise controlled by processor1402. In this regard, communications module 1406 may be in communicationwith processor 1402, such as via a bus. Communications module 1406 mayinclude, for example, an antenna, a transmitter, a receiver, atransceiver, network interface card and/or supporting hardware and/orfirmware/software for enabling communications with another computingdevice. Communications module 1406 may be configured to receive and/ortransmit any data that may be stored by memory 1404 using any protocolthat may be used for communications between computing devices.Communications module 1406 may additionally or alternatively be incommunication with the memory 1404, input/output module 1408 and/or anyother component of circuitry 1400, such as via a bus.

Input/output module 1408 may be in communication with processor 1402 toreceive an indication of a user input and/or to provide an audible,visual, mechanical, or other output to a user. Some example visualoutputs that may be provided to a user by circuitry 1400 are discussedin connection with FIG. 1A. As such, input/output module 1408 mayinclude support, for example, for a keyboard, a mouse, a joystick, adisplay, a touch screen display, a microphone, a speaker, a RFID reader,barcode reader, biometric scanner, and/or other input/output mechanisms.In embodiments wherein circuitry 1400 is embodied as a server ordatabase, aspects of input/output module 1408 may be reduced as comparedto embodiments where circuitry 1400 is implemented as an end-usermachine or other type of device designed for complex user interactions.In some embodiments (like other components discussed herein),input/output module 1408 may even be eliminated from circuitry 1400.Alternatively, such as in embodiments wherein circuitry 1400 is embodiedas a server or database, at least some aspects of input/output module1408 may be embodied on an apparatus used by a user that is incommunication with circuitry 1400. Input/output module 1408 may be incommunication with the memory 1404, communications module 1406, and/orany other component(s), such as via a bus. Although more than oneinput/output module and/or other component can be included in circuitry1400, only one is shown in FIG. 14 to avoid overcomplicating the drawing(like the other components discussed herein).

Adaptive learning module 1410 may also or instead be included andconfigured to perform the functionality discussed herein related to theadaptive learning crowd-based framework discussed above. In someembodiments, some or all of the functionality of adaptive learning maybe performed by processor 1402. In this regard, the example processesand algorithms discussed herein can be performed by at least oneprocessor 1402 and/or adaptive learning module 1410. For example,non-transitory computer readable media can be configured to storefirmware, one or more application programs, and/or other software, whichinclude instructions and other computer-readable program code portionsthat can be executed to control each processor (e.g., processor 1402and/or adaptive learning module 1410) of the components of system 400 toimplement various operations, including the examples shown above. Assuch, a series of computer-readable program code portions are embodiedin one or more computer program products and can be used, with acomputing device, server, and/or other programmable apparatus, toproduce machine-implemented processes.

Any such computer program instructions and/or other type of code may beloaded onto a computer, processor or other programmable apparatus'scircuitry to produce a machine, such that the computer, processor otherprogrammable circuitry that execute the code on the machine create themeans for implementing various functions, including those describedherein.

It is also noted that all or some of the information presented by theexample displays discussed herein can be based on data that is received,generated and/or maintained by one or more components of adaptiveoracle-trained learning framework 100. In some embodiments, one or moreexternal systems (such as a remote cloud computing and/or data storagesystem) may also be leveraged to provide at least some of thefunctionality discussed herein.

As described above in this disclosure, aspects of embodiments of thepresent invention may be configured as methods, mobile devices, backendnetwork devices, and the like. Accordingly, embodiments may comprisevarious means including entirely of hardware or any combination ofsoftware and hardware. Furthermore, embodiments may take the form of acomputer program product on at least one non-transitorycomputer-readable storage medium having computer-readable programinstructions (e.g., computer software) embodied in the storage medium.Any suitable computer-readable storage medium may be utilized includingnon-transitory hard disks, CD-ROMs, flash memory, optical storagedevices, or magnetic storage devices.

Embodiments of the present invention have been described above withreference to block diagrams and flowchart illustrations of methods,apparatuses, systems and computer program products. It will beunderstood that each block of the circuit diagrams and process flowdiagrams, and combinations of blocks in the circuit diagrams and processflowcharts, respectively, can be implemented by various means includingcomputer program instructions. These computer program instructions maybe loaded onto a general purpose computer, special purpose computer, orother programmable data processing apparatus, such as processor 1402and/or adaptive learning module 1410 discussed above with reference toFIG. 14, to produce a machine, such that the computer program productincludes the instructions which execute on the computer or otherprogrammable data processing apparatus create a means for implementingthe functions specified in the flowchart block or blocks.

These computer program instructions may also be stored in acomputer-readable storage device (e.g., memory 1404) that can direct acomputer or other programmable data processing apparatus to function ina particular manner, such that the instructions stored in thecomputer-readable storage device produce an article of manufactureincluding computer-readable instructions for implementing the functiondiscussed herein. The computer program instructions may also be loadedonto a computer or other programmable data processing apparatus to causea series of operational steps to be performed on the computer or otherprogrammable apparatus to produce a computer-implemented process suchthat the instructions that execute on the computer or other programmableapparatus provide steps for implementing the functions discussed herein.

Accordingly, blocks of the block diagrams and flowchart illustrationssupport combinations of means for performing the specified functions,combinations of steps for performing the specified functions and programinstruction means for performing the specified functions. It will alsobe understood that each block of the circuit diagrams and processflowcharts, and combinations of blocks in the circuit diagrams andprocess flowcharts, can be implemented by special purpose hardware-basedcomputer systems that perform the specified functions or steps, orcombinations of special purpose hardware and computer instructions

Many modifications and other embodiments of the inventions set forthherein will come to mind to one skilled in the art to which theseinventions pertain having the benefit of the teachings presented in theforegoing descriptions and the associated drawings. Therefore, it is tobe understood that the inventions are not to be limited to the specificembodiments disclosed and that modifications and other embodiments areintended to be included within the scope of the appended claims.Although specific terms are employed herein, they are used in a genericand descriptive sense only and not for purposes of limitation.

1. A system comprising one or more computers configured to implement anadaptive learning framework for automatically building and maintaining apredictive model for processing dynamic data, wherein the adaptivelearning framework is configured to include: the predictive model,wherein the model is configured to generate model output from processingan input data instance received by the adaptive learning framework, andwherein the model output includes a judgment and a confidence valuerepresenting certainty of the judgment; a training data set from whichthe predictive model is derived using machine learning; and a trainingdata manager, wherein the training data manager is configured forcurating the training data set; and a labeled data reservoir configuredto store labeled data instances that have been processed by thepredictive model, wherein the labeled data reservoir includes a pool ofpossible training data, wherein the set of labeled data instances arenot included in the training data; and wherein each labeled datainstance is associated with a true label representing the instance; andwherein the training data manager is configured to perform operationscomprising: determining whether to update the training data set; in aninstance in which the training data set is to be updated, selecting aset of labeled data instances from a labeled data reservoir; andupdating the training data using the set of labeled data instances. 2.The system of claim 1, wherein determining whether to update thetraining data set is based at least in part on analyzing thedistribution and quality of the training data.
 3. The system of claim 1,wherein determining whether to update the training data set is based atleast in part on an accuracy assessment of the model performance.
 4. Thesystem of claim 3, wherein the accuracy assessment is based ondetermining whether the confidence value satisfies a confidencethreshold value.
 5. The system of claim 2, wherein the current model isa classifier predicting to which of a set of predictive categories aninput data instance belongs, wherein a true label associated with alabeled data instance identifies the predictive category to which thelabeled data instance belongs, and wherein selecting the set of labeleddata instances from the labeled data reservoir is based at least in parton maintaining a class balance within the training data.
 6. The systemof claim 1, wherein the labeled data reservoir includes labeled datainstances that are received from multiple sources, and wherein selectinga labeled data instance from the set of labeled data instancescomprises: comparing a source of the labeled data instance with apre-determined source; and selecting the labeled data instance in aninstance in which the source of the labeled data instance matches thepre-determined source.
 7. The system of claim 1, wherein the operationsfurther comprise: in response to updating the training data, determiningwhether to re-train the model; in an instance in which the model isre-trained, generating at least one candidate training data set usingthe updated training data; deriving a candidate model using thecandidate training data set; generating an assessment of whether thecandidate model performance is improved from the model performance; andinstantiating the candidate training data set and the candidate model inthe adaptive learning framework in an instance in which the candidatemodel performance is improved from the model performance.
 8. The systemof claim 7, wherein generating the assessment of whether the candidatemodel performance is improved from the model performance includes A/Btesting.
 9. The system of claim 8, wherein generating the assessmentcomprises calculating a cross-validation between the candidate modelperformance and the model performance.
 10. The system of claim 8,wherein there are multiple candidate models, and wherein generating theassessment respectively for each of the multiple candidate models isimplemented in parallel.
 11. A computer program product, stored on anon-transitory computer readable medium, comprising instructions thatwhen executed on one or more computers cause the one or more computersto implement an adaptive learning framework for automatically buildingand maintaining a predictive model for processing dynamic data, whereinthe adaptive learning framework is configured to include: the predictivemodel, wherein the model is configured to generate model output fromprocessing an input data instance received by the adaptive learningframework, and wherein the model output includes a judgment and aconfidence value representing certainty of the judgment; a training dataset from which the predictive model is derived using machine learning;and a training data manager, wherein the training data manager isconfigured for curating the training data set; and a labeled datareservoir configured to store labeled data instances that have beenprocessed by the predictive model, wherein the labeled data reservoirincludes a pool of possible training data, wherein the set of labeleddata instances are not included in the training data; and wherein eachlabeled data instance is associated with a true label representing theinstance; and wherein the training data manager is configured to performoperations comprising: determining whether to update the training dataset; in an instance in which the training data set is to be updated,selecting a set of labeled data instances from a labeled data reservoir;and updating the training data using the set of labeled data instances.12. The computer program product of claim 11, wherein determiningwhether to update the training data set is based at least in part onanalyzing the distribution and quality of the training data.
 13. Thecomputer program product of claim 11, wherein determining whether toupdate the training data set is based at least in part on an accuracyassessment of the model performance.
 14. The computer program product ofclaim 13, wherein the accuracy assessment is based on determiningwhether the confidence value satisfies a confidence threshold value. 15.The computer program product of claim 12, wherein the current model is aclassifier predicting to which of a set of predictive categories aninput data instance belongs, wherein a true label associated with alabeled data instance identifies the predictive category to which thelabeled data instance belongs, and wherein selecting the set of labeleddata instances from the labeled data reservoir is based at least in parton maintaining a class balance within the training data.
 16. Thecomputer program product of claim 11, wherein the labeled data reservoirincludes labeled data instances that are received from multiple sources,and wherein selecting a labeled data instance from the set of labeleddata instances comprises: comparing a source of the labeled datainstance with a pre-determined source; and selecting the labeled datainstance in an instance in which the source of the labeled data instancematches the pre-determined source.
 17. The computer program product ofclaim 11, wherein the operations further comprise: in response toupdating the training data, determining whether to re-train the model;in an instance in which the model is re-trained, generating at least onecandidate training data set using the updated training data; deriving acandidate model using the candidate training data set; generating anassessment of whether the candidate model performance is improved fromthe model performance; and instantiating the candidate training data setand the candidate model in the adaptive learning framework in aninstance in which the candidate model performance is improved from themodel performance.
 18. The computer program product of claim 17, whereingenerating the assessment of whether the candidate model performance isimproved from the model performance includes A/B testing.
 19. Thecomputer program product of claim 18, wherein generating the assessmentcomprises calculating a cross-validation between the candidate modelperformance and the model performance.
 20. The computer program productof claim 18, wherein there are multiple candidate models, and whereingenerating the assessment respectively for each of the multiplecandidate models is implemented in parallel.